# Loading BOSTON dataset.
dataset <- read.csv("boston.csv")
head(dataset)
- LON and LAT are the longitude and latitude of the center of the census tract.
- MEDV is the median value of owner-occupied homes, measured in thousands of dollars.
- CRIM is the per capita crime rate.
- ZN is related to how much of the land is zoned for large residential properties.
- INDUS is the proportion of the area used for industry.
- CHAS is 1 if a census tract is next to the Charles River else 0
- NOX is the concentration of nitrous oxides in the air, a measure of air pollution.
- RM is the average number of rooms per dwelling.
- AGE is the proportion of owner-occupied units built before 1940.
- DIS is a measure of how far the tract is from centres of employment in Boston.
- RAD is a measure of closeness to important highways.
- TAX is the property tax per $10,000 of value.
- PTRATIO is the pupil to teacher ratio by town
str(dataset)
## 'data.frame': 506 obs. of 16 variables:
## $ TOWN : chr "Nahant" "Swampscott" "Swampscott" "Marblehead" ...
## $ TRACT : int 2011 2021 2022 2031 2032 2033 2041 2042 2043 2044 ...
## $ LON : num -71 -71 -70.9 -70.9 -70.9 ...
## $ LAT : num 42.3 42.3 42.3 42.3 42.3 ...
## $ MEDV : num 24 21.6 34.7 33.4 36.2 28.7 22.9 22.1 16.5 18.9 ...
## $ CRIM : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ ZN : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ INDUS : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ CHAS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NOX : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ RM : num 6.58 6.42 7.18 7 7.15 ...
## $ AGE : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ DIS : num 4.09 4.97 4.97 6.06 6.06 ...
## $ RAD : int 1 2 2 3 3 3 5 5 5 5 ...
## $ TAX : int 296 242 242 222 222 222 311 311 311 311 ...
## $ PTRATIO: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
summary(dataset)
## TOWN TRACT LON LAT
## Length:506 Min. : 1 Min. :-71.29 Min. :42.03
## Class :character 1st Qu.:1303 1st Qu.:-71.09 1st Qu.:42.18
## Mode :character Median :3394 Median :-71.05 Median :42.22
## Mean :2700 Mean :-71.06 Mean :42.22
## 3rd Qu.:3740 3rd Qu.:-71.02 3rd Qu.:42.25
## Max. :5082 Max. :-70.81 Max. :42.38
## MEDV CRIM ZN INDUS
## Min. : 5.00 Min. : 0.00632 Min. : 0.00 Min. : 0.46
## 1st Qu.:17.02 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19
## Median :21.20 Median : 0.25651 Median : 0.00 Median : 9.69
## Mean :22.53 Mean : 3.61352 Mean : 11.36 Mean :11.14
## 3rd Qu.:25.00 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10
## Max. :50.00 Max. :88.97620 Max. :100.00 Max. :27.74
## CHAS NOX RM AGE
## Min. :0.00000 Min. :0.3850 Min. :3.561 Min. : 2.90
## 1st Qu.:0.00000 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02
## Median :0.00000 Median :0.5380 Median :6.208 Median : 77.50
## Mean :0.06917 Mean :0.5547 Mean :6.285 Mean : 68.57
## 3rd Qu.:0.00000 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08
## Max. :1.00000 Max. :0.8710 Max. :8.780 Max. :100.00
## DIS RAD TAX PTRATIO
## Min. : 1.130 Min. : 1.000 Min. :187.0 Min. :12.60
## 1st Qu.: 2.100 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40
## Median : 3.207 Median : 5.000 Median :330.0 Median :19.05
## Mean : 3.795 Mean : 9.549 Mean :408.2 Mean :18.46
## 3rd Qu.: 5.188 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20
## Max. :12.127 Max. :24.000 Max. :711.0 Max. :22.00
# Checking for missing values.
colSums(is.na(dataset))
## TOWN TRACT LON LAT MEDV CRIM ZN INDUS CHAS NOX
## 0 0 0 0 0 0 0 0 0 0
## RM AGE DIS RAD TAX PTRATIO
## 0 0 0 0 0 0
colSums(dataset=='')
## TOWN TRACT LON LAT MEDV CRIM ZN INDUS CHAS NOX
## 0 0 0 0 0 0 0 0 0 0
## RM AGE DIS RAD TAX PTRATIO
## 0 0 0 0 0 0
Insight
-Character - TOWN
-Integer - TRACT,CHAS,RAD,TAX
-Numeric - LON,LAT,MEDV,CRIM,ZN,INDUS,NOX,RM,AGE,DIS,PTRATIO
-Number of columns:16 5.Number of observation:506
-The house price ranges between 5000 and 50000 dollars.
-The Longitude and Latitude lies on -71 and 42 coordinate points respectively in Boston.
par(mfrow=c(2,3))
plot(dataset$LON, dataset$MEDV, main = "Longitude", xlab="", ylab="", cex.lab="1.5", col="#1e3799", pch=20, cex= 0.5 )
plot(dataset$LAT, dataset$MEDV, main = "Latitude", xlab="", ylab="", cex.lab="1.5", col="#628ca6", pch=20, cex= 0.5)
plot(dataset$CRIM, dataset$MEDV, main = "Crime Rate", xlab="", ylab="", cex.lab="1.5", col="#e58e26", pch=20, cex= 0.5)
plot(dataset$ZN, dataset$MEDV, main = "Residential Zone", xlab="", ylab="", cex.lab="1.5", col="#628ca6", pch=20, cex= 0.5)
plot(dataset$INDUS, dataset$MEDV, main = "Industry Area", xlab="", ylab="", cex.lab="1.5", col="#e58e26", pch=20, cex= 0.5)
plot(dataset$TAX, dataset$MEDV, main = "Tax per $10,000", xlab="", ylab="", cex.lab="1.5", col="#1e3799", pch=20, cex= 0.5)
par(mfrow=c(2,3))
plot(dataset$NOX, dataset$MEDV, main = "Nitrous Oxides", xlab="", ylab="", cex.lab="1.5", col="#84817a", pch=20, cex= 0.5 )
plot(dataset$RM, dataset$MEDV, main = "Number of Rooms", xlab="", ylab="", cex.lab="1.5", col="#cd6133", pch=20, cex= 0.5 )
plot(dataset$CRIM, dataset$MEDV, main = "Crime Rate", xlab="", ylab="", cex.lab="1.5", col="#b33939", pch=20, cex= 0.5 )
plot(dataset$AGE, dataset$MEDV, main = "Built before 1940 (Age)", xlab="", ylab="", cex.lab="1.5", col="#b33939", pch=20, cex= 0.5 )
plot(dataset$DIS, dataset$MEDV, main = "Employment", xlab="", ylab="", cex.lab="1.5", col="#cd6133", pch=20, cex= 0.5 )
plot(dataset$RAD, dataset$MEDV, main = "Highways", xlab="", ylab="", cex.lab="1.5", col="#84817a", pch=20, cex= 0.5 )
Insight
-There is an observation that when house prices go down there is a significant increase in crime rate.
-There is presence of nitous oxide in air which cause air pollution.And its normally distributed in all range of house price regions.
-There is higher probability of employment opportunity when the housing price significantly increases.
par(mfrow=c(1,2))
boxplot(dataset$LON, main = "Longitude")
boxplot(dataset$LAT, main = "Latitude")
outliers <- boxplot(dataset$LON, plot=FALSE)$out
dataset[which(dataset$LON %in% outliers),]
dataset <- dataset[-which(dataset$LON %in% outliers),]
outliers <- boxplot(dataset$LAT, plot=FALSE)$out
dataset[which(dataset$LAT %in% outliers),]
dataset <- dataset[-which(dataset$LAT %in% outliers),]
par(mfrow=c(1,2))
boxplot(dataset$LON, main = "Longitude")
boxplot(dataset$LAT, main = "Latitude")
Insight
-Outliers are spotted out using boxplot. Here we notice that longitude values have outlier where as latitude doesnt have. So the outliers are firstly spotted and removed from the dataset.
# https://colorhunt.co/
plot(dataset$LON, dataset$LAT, main = "Charles River in a blue colour", xlab="Longitude", ylab="Latitude", col="#bbbbbb", pch=20, cex= 0.7)
points(dataset$LON[dataset$CHAS==1], dataset$LAT[dataset$CHAS==1], col="#005fba", pch=16)
Insight
-It shows that there is no linearity and moderate relationship between latitude&longitude over house locations.
-Blue colored highlighted house locations were generally lies along the Charles River.
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
fig <- plot_ly(data = dataset, x = ~dataset$LAT, y = ~dataset$MEDV, marker = list(size = 10,color = '#ffa500'))
fig <- fig %>% layout(title = 'Latitude', yaxis = list(title = "Price in $", zeroline = TRUE),xaxis = list(title = "", zeroline = TRUE))
fig
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
fig <- plot_ly(data = dataset, x = ~dataset$LON, y = ~dataset$MEDV, marker = list(size = 10,color = '#005AFF'))
fig <- fig %>% layout(title = 'Longitude', yaxis = list(title = "Price in $", zeroline = TRUE),xaxis = list(title = "", zeroline = TRUE))
fig
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
#linear model
linear_model = lm(MEDV ~ LAT + LON, data=dataset)
summary(linear_model)
##
## Call:
## lm(formula = MEDV ~ LAT + LON, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.616 -5.677 -1.039 3.807 28.677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5481.508 651.768 -8.410 5.27e-16 ***
## LAT 13.983 7.311 1.913 0.0564 .
## LON -69.151 7.214 -9.585 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.521 on 457 degrees of freedom
## Multiple R-squared: 0.1674, Adjusted R-squared: 0.1638
## F-statistic: 45.95 on 2 and 457 DF, p-value: < 2.2e-16
Insight
library(rpart)
library(rpart.plot)
tree = rpart(MEDV ~ LAT + LON + CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO, data=dataset)
prp(tree)
Insight
-The variables RM, NOX, AGE, CRIM, DIS are the variables used to predict the correspoding output- MEDV. According to Variable Importance, the variables in decreasing order of the priority: RM, NOX, CRIM, INDUS, PTRATIO, DIS, TAX, RAD, ZN, LON, CHAS. Among them the first 6 variables are taken into consideration. The set of approximate values that are predicted are 11,18,26,22,27,32,46.